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Abstract 

Background: Early detection of breast cancer in blood is both appealing clinically and challenging technically due 
to the disease's illusive nature and heterogeneity. Today, even though major breast cancer subtypes have been 
characterized, i.e., luminal A, luminal B, HER2+, and basal-like, little is known about the heterogeneity of breast 
cancer in blood, which could help to discover minimally invasive protein biomarkers with which clinical researchers 
can detect, classify, and monitor different breast cancer subtypes. 

Results: In this study, we performed an integrative pathway-assisted clustering analysis of breast cancer subtypes 
from plasma proteome samples collected from 80 patients diagnosed with breast cancer and 80 healthy women. 
First, four breast cancer subtypes and additionally unknown subtype (according to existing annotation) were 
determined based on pathology lab test results in primary tumors of enrolled patients. Next, we developed and 
applied four distance metrics, i.e., Protein Intensity, Q-Value, Pathway Profile, and Distance Score Function, to 
measure and characterize these cancer subtypes. Then, we developed a permutation test to evaluate the 
significant protein level changes in each biological pathway for each breast cancer subtype, using q-value. Lastly, 
we developed a pathway-protein matrix for each of the four distance methods to estimate the distance between 
breast cancer subtypes, for which further Pathway Association Network analysis were performed. 

Conclusions: We found that 1) the luminal group (luminal A and luminal B) are clustered together, as well as the 
basal group (basal-like and HER2+) and 2) luminal A and luminal B are more close to each other than basal-like 
and HER2+ to each other. Our results were consistent with a recent independent breast cancer research from the 
Cancer Genome Atlas Network using genomic DNA copy number arrays, DNA methylation, exome sequencing, 
messenger RNA arrays, microRNA sequencing and reverse-phase protein arrays. Our results showed that changes of 
different breast cancer subtypes at the pathway level are more profound and less variable than those at the 
molecular level. Similar subtypes share distinct yet similar pathway activation networks, while dissimilar subtypes 
are different also at the level of pathway activation networks. The results also showed that distance or similarity of 
cancer subtypes based on pathway analysis might be able to provide further insight into the intrinsic relationship 
of breast cancer subtypes. We believe integrative pathway-assisted proteomics analysis described here can become 
a model for reliable clustering or classification of other cancer subtypes. 



Background cancer, about 63,300 new cases of carcinoma in situ (CIS), 

Early detection and early intervention are keys to success- and about 39,510 breast cancer deaths for 2012. 
ful treatment of breast cancer, the second most common Functional genomics studies using DNA Microarrays or 
type of cancer after lung cancer worldwide. The American tandem mass spectrometry have been shown effective in 
Cancer Society estimated that, in the United States alone, differentiating between breast cancer tissues and normal 
there will be about 226,870 new cases of invasive breast tissues, by measuring thousands of differentially expressed 

genes or proteins simultaneously [1-3]. However, early 
detection of breast cancer in blood are both appealing 
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cancers and 2) the fact that breast cancer is not a single 
homogeneous disease but consists of multiple disease sub- 
types, each arising from a distinct molecular mechanism 
and having a distinct clinical progression path [4]. Accu- 
rate classification of breast cancer subtypes, coupled with 
early detection, is therefore critical to effective cancer 
treatment, because mechanistically homogeneous breast 
cancer subtypes are easier to be distinguished from non- 
cancer conditions or treated than the entire heterogeneous 
breast cancer group. 

Breast cancer disease heterogeneity has been confirmed 
at the gene expression level and categorized into five 
molecular subtypes: luminal A, luminal B, HER2+, basal- 
like, and normal-like, each with distinct gene expression 
patterns and prognosis [5]. For example, Perou et al. [6] 
originally showed that ER status divided breast tumors in 
two different branches, each arising from one of the two 
types of breast cancer cells: basal (ER-negative) and lumi- 
nal (ER-positive) cells. ER-positive tumors represent 34- 
66% of all breast cancers. Tumors in the ER-positive 
group have expression patterns reminiscent of the luminal 
epithelial cells of the breast. ER-negative breast cancers, 
which represent 30-45% of all breast cancer, are character- 
ized by lack of Hormone Receptor expression and low to 
absent expression of some other luminal markers. Recent 
inclusion of a larger number of samples and meta-analysis 
showed that luminal tumors can be subdivided into lumi- 
nal A and luminal B subtypes, and basal tumors into 
normal-like, basal, and HER2+[5,7]. 

Recent studies further established refined classification 
criteria among these subtypes, based on similarity mea- 
sures of their gene expression profiles [7,8]. This new clas- 
sification system has given researchers more significant 
insights into the pathogenesis and metastasis of tumors 
than conventional pathological classification methods can 
do. However, there are several limitations that restrict 
their clinical applications. First, the transcription of genes 
is the first stage of gene expression and doesn't actually 
represent the actual functional molecules in the cell. On 
the contrary, the proteome, the complete set of proteins 
produced by the genome at any one time, is much more 
complex and dynamic than either the genome or the tran- 
scriptome. And in proteomics the real functional mole- 
cules of the cell are being studied. Second, either the gene 
expression signatures or protein changes from different 
tumor samples can be highly variable [9]. Third, the over- 
lap of these signatures or changes among different data 
sets has been poor. For example, a comparison of results 
from two breast cancer treatment prognostic studies that 
led to clinical adoption and commercialization, Mamma- 
print and OncoDX, [10,11] revealed a very limited overlap 
between them, with only 3 out of 70 or 76 genes in com- 
mon. Lastly, recent studies showed that genes or proteins 



differentially expressed between the primary breast (PBT) 
and metastatic lymph node (MLN) can be quite different, 
further limiting the power of determining prognostic out- 
comes based on gene expression or protein change profiles 
of PBT alone [12,13]. 

To accurately classify breast cancer subtypes in the pre- 
sence of inherent "Omics" data noises, new pathway- 
driven data analysis approaches combined with proteomics 
have become necessary. In the past, many statistical meth- 
ods have been developed to improve identification of sub- 
sets of differentially expressed genes or proteins from 
transcriptome or proteome profiling experiments of breast 
cancer [14]. While statistical tests can point to genes or 
proteins significantly altered between different cancer 
states or subtypes, they do not readily explain the biologi- 
cal contexts of such changes. In contrast, it is becoming 
increasingly apparent that genes or proteins function 
through complex molecular interactions to each other 
[15-17]. For example, breast cancer cell growth can be dri- 
ven by mutations that lead to the constituent activation of 
the oestrogen receptor pathway [18]. It has also been dis- 
covered that different breast cancer subtypes originate 
from separate pathways [19]. However, due to lack of 
human biological pathway databases with sufficient quality 
and data coverage, integrated pathway study of breast can- 
cer subtypes, particularly for data derived from proteomics 
experiments, have not been previously reported. 

In this study, we developed an integrative pathway- 
assisted proteomics analysis method to study biological 
pathway-level changes in human breast cancer related 
plasma proteins, which we characterized from liquid-chro- 
matography coupled tandem mass spectrometry (LC-MS/ 
MS) proteomics experiments of plasma samples collected 
from patients diagnosed with breast cancer and healthy 
individuals. We used the new Human Pathway Database 
(HPD) [20] and the Integrated Pathway Analysis Database 
for Systematic Enrichment Analysis (IPAD) [21] to gain 
information on nearly one thousand human biological 
pathways, pathway protein constituents, and pathway- 
pathway similarity relationships. We labeled breast cancer 
patients with one of the four subtypes (luminal A, luminal 
B, HER2+, and basal-like) and additionally unknown sub- 
type, using currently established clinical classification cri- 
teria [7,8]. We studied the use of four measures to the 
classification of breast cancer plasma proteomic results: 
Protein Intensity, Q- Value, Pathway Profiling, and Dis- 
tance Score Function. We developed a permutation test to 
evaluate significant protein level changes in each biological 
pathway for each breast cancer subtype, using Q-value. 
Lastly, we developed a pathway-protein matrix for each of 
the four distance methods to estimate the distance 
between breast cancer subtypes and further performed 
pathway association analysis for each subtype. 
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Our results showed that each breast cancer subtype may 
be associated with changes of many different plasma pro- 
teins; however, the changes of different cancer subtypes at 
the pathway level are more profound and less variable 
than those at the molecular level. Similar subtypes share 
distinct yet similar pathway activation networks, while dis- 
similar subtypes are different also at the level of pathway 
activation networks. The results also showed that distance 
or similarity of cancer subtypes based on pathway analysis 
might be able to provide further insight into the intrinsic 
relationship of breast cancer subtypes. We believe integra- 
tive pathway-assisted proteomics analysis described here 
can become a model for reliable classification of other 
cancer subtypes. 

Methods 

Materials 

Ammonium carbonate, ammonium bicarbonate, urea, for- 
mic acid, lysozyme, 2-Iodoethanol, and triethylphosphine 
were all purchased from Sigma- Aldrich (St. Louis, MO, 
USA). Acetonitrile and MS grade water were purchased 
from Honey Burdick & Jackson (Morristown, NJ, USA). 
Trypsin was purchased from Worthington Biochemical 
Corporation (Lakewood, NJ, USA). Seppro tip IgY-12 and 
reagent kit were purchased from GenWay Biotech (San 
Diego, CA, USA). 

Human plasma samples 

Two batches of plasma samples were collected by the 
Hoosier Oncology Group (HOG) (Indianapolis, IN, USA) 
(each contained 40 plasma samples from women with 
breast cancer and 40 plasma samples from healthy age- 
matched volunteer women as control). All patients 
involved in this study were diagnosed with a stage II or 
earlier breast cancer. Most patients had previously been 
treated with chemotherapy. All samples were collected 
with the same standard operating procedure and stored in 
a central repository in Indianapolis, IN, USA. 

Proteomics methods 

Biomarker identification and characterization holds great 
promise for more precise diagnoses and for tailored 
therapies. The heterogeneity of human cancers and 
unmet medical needs in these diseases provides a compel- 
ling argument to focus biomarker development in cancer. 
Mass Spectrometry (MS)-based proteomics approaches 
have provided insight into biomarkers of cancer and 
other diseases with femtomole sensitivity and high analy- 
tical precision. 

Label-free protein identification and protein quantitative 
analysis services were performed by professionals at the 
Protein Analysis and Research Center/Proteomics Core of 
Indiana University School of Medicine, co-located at 



Monarch Life Sciences, Inc, Indianapolis. For a thorough 
review of the principle and method developed and used, 
refer to the review by Wang et al [22]. 

Proteins were prepared and subjected to LC/MS/MS 
analysis. First, all samples were run on a Surveyor HPLC 
(ThermoFinnigan) with a C18 microbore column (Zorbax 
300SBC18, 1 mm x 5 cm). Then, all tryptic peptides 
(100 uL or 20 ug) were injected onto the column in ran- 
dom order. Next, peptides were eluted with a linear gradi- 
ent from 5% to 45% acetonitrile developed over 120 min 
at a flow rate of 50 uL/min, and the eluant was introduced 
into a ThermoFinnigan LTQ linear ion-trap mass spectro- 
meter. Last, the data were collected in the "triple-play" 
mode (MS scan, Zoom scan, and MS/MS scan). The data- 
base searches was performed using both the XITandem 
and SEQUEST algorithms. 

Protein identification and quantification 

The International Protein Index (IPI) was used to map and 
identify sequence IDs. A LC/MS-based label-free protein 
quantification software licensed from Eli Lilly and Com- 
pany was used to perform the protein quantification. First, 
all extracted ion chromatograms (XIC) were aligned by 
retention time, after the raw files were acquired from the 
LTQ. Each aligned peak should match parent ion, charge 
state, daughter ions (MS/MS data) and retention time 
(within a one-minute window). If any of these parameters 
are not matched, the peak is disqualified from the quantifi- 
cation analysis. Then, after the alignment, the area-under- 
the-curve (AUC) from individually aligned peak was 
measured, normalized, and compared for their relative 
abundance using methods described in [14]. Last, we trans- 
formed all peak intensities to a log2 scale for quantile nor- 
malization. If multiple peptides have the same protein 
identification, their quantile normalized log2 intensities are 
averaged to obtain log2 protein intensities, which are fit by 
the analysis of variance (ANOVA) statistical model for each 
protein as y ijk : 

Yijk = n + Tj + S fe + U + Sijk, (1) 

where I t ~ N(0, a 2 ), S k ~ N(0, cr 2 2 ), e jk ~ N(0, a 2 ) , 
[i is the mean intensity value, 7} is the fixed group effect 
(caused by the experimental conditions or treatments 
being evaluated), Sk is the random sample effect (ran- 
dom effects from either individual biological samples or 
sample preparations), 7 ; is the random replicate effect 
(random effects from replicate injections of the same 
sample), and are the within-groups errors. All of the 
injections were in random order and the instrument was 
operated by the same operator. All random effects are 
assumed independent of each other and independent of 
the within-groups errors e^. 
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Statistics test 

Statistical Significance was measured by a three-step 
method. First, we conducted a permutation test to cal- 
culate the permutation test p value (also called false dis- 
covery rate). Then we calculated the FDR adjusted p 
value. Last, we calculated the FDR q value using the 
Storey-Tibshirani method [23]. 

We chose three significance screening filters (cutoffl 
q<0.2, cutoff2 q<0.1, and cutoff3 ^<0.05) to select pro- 
teins where we estimated significant differences in the 
healthy samples and each of the cancer subtypes. The 
False Positive Rate (FPR) or expected proportion of false 
positive among the proteins with declared changes is 
FPR = qvalue x number of the proteins with declared 
changes. 

Permutation test 

We presented a permutation test which used two-sample t- 
statistics (for equal variance) and Welch's t-Test statistics 
(for unequal variance) to calculate the p-value of null 
hypothesis H 0 : [i h = \i c (h: healthy samples; cxancer sub- 
type) instead of using a single t-statistics (two-sample t-sta- 
tistics or Welch's t-Test statistics) without considering 
sample variance in the traditional permutation test. Our 
test statistic is the difference between the mean of protein 
intensities in healthy samples and the mean of proteins 
intensities in cancer subtypes samples divided by the stan- 
dard error of the mean. The statistical significance of pro- 
tein with change was assessed by computing a permutation 
test p-value for each protein, representing the chance of 
observing a test statistic at least as large as the value actu- 
ally obtained (Figure 1). All samples across the two groups 
for each protein were permuted 100,000 times and the 
complete set of t-tests between the two groups was per- 
formed for each permutation according to equations (2) 
and (3). The permutation p-value for a particular protein is 
the proportion of the permutations in which the permuted 
test p-value doesn't exceed the observed test p-value in 
absolute value. 

After resampling, an F test is performed to compare 
the variances of two samples from normal populations. 
The F test's null hypothesis is that the variances of the 
two samples are equal. If the null hypothesis cannot be 
rejected, the 2 sample t-test with equal variance will be 
used. On the other hand, if the result of F-test indicates 
that the null hypothesis should be rejected, Welch's t- 
Test will be used. 

For equal variance, 2 sample t-test statistics is calcu- 
lated as 



Xi — X? 



1 1 

SxjXj - J — + — 



(2) 



Resampling 



No 



Fail to Reject HO 
i 




2 Sample t-Test " 
(equal variances) / 



f Welch's 
t-Test 



Adjust Permutation Test P-Value 




Final Permutation Test P-Value 



Figure 1 flowchart of permutation test 



where SjCiX2 = \ { — - 1)S ^ + ( " 2 - ^ . s XlXl is an estima- 



l%\ + n 2 - 2 

tor of the common standard deviation of the two sam- 
ples. The degree of freedom for this test is «j + n 2 -2. 

For unequal variance, Welch's t-Test statistics is cal- 
culated as 



t = 



Xi — Xi 



X 1 —X 2 



(3) 



where s- 



/ _L + _1 For use in significance test- 
Xl ~ Xl V m n 2 
ing, the distribution of the test statistic is approximated 
as being a Student's t distribution with the degrees of 
freedom calculated using 



df - 



(sf/ni +sl/n 2 ) 



(5?/nir(m-l) + (5i/n 2 )7(n 2 -l) 



(4) 



The permutation test p-value is adjusted at each 
resampling, if the permuted test p-value is less than or 
equal to the observed test p-value in absolute values, 
until the number of resampling reaches the number lim- 
itation of permutation. 

Cancer subtypes distances 
Protein intensity 

For protein intensity in a cancer subtype, a number of fac- 
tors contribute to system variability: protein effects, sub- 
type effects, individual effects and so on, plus interactions 
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of these factors. Each of these main effects and interac- 
tions can be accounted for in a linear mixed model. Linear 
mixed model is primarily used to describe relationships 
between a response variable and some covariates in data 
that are grouped according to one or more classification 
factors. Linear mixed model estimates variance compo- 
nents using the residual maximum likelihood (REML) 
approach. We used R language and environment for statis- 
tical computing and graphics, which is a GNU project 
which is similar to the S language and environment 
(http://www.r-project.org), to fit linear mixed model. 

A full factorial model was used to represent the fixed 
effect and two random effects which were used to 
account for individuals, subtypes and their interaction. 
The corresponding linear mixed model for the expres- 
sion log ratios value y ijk for each gene/protein in the /th 
subtypes of the /th individual, is 



Yijk = M + Sj + h + Sly + Syk, 



(5) 



Where /; ~ N(0, of), Sly ~ N(0, of), e tjk ~ N(0, a 2 ) . 
The fixed effects in Equation (5) are the grand mean, 
and Sj , the subtypes. The random effects in Equation 
(5) are the individual random effect, and S7 /y , the 
subtypes within individual random effect. All random 
effects were assumed independent of each other and 
independent of the within-groups errors e^. 

We defined the distance between subtype s and sub- 
type t as 



d st = 1 - Corr(S s ,S t ), 



(6) 



where Corr is the Pearson correlation coefficient. 
Q-value 

Based on the q value of protein change between healthy 
samples and cancer subtype samples, we defined the dis- 
tance between subtype s and subtype t as 



d st = 1 — Corr(qvalue s , qvalue t ), 



(7) 



where Corr is the Pearson correlation coefficient. 
Pathway profiling 

Based on the number of proteins in each pathway which 
is identified by Pathway Analysis, we defined the dis- 
tance between subtype s and subtype t as 



d st = \-Corr{N s ,Nt), 



(8) 



where Corr is the Pearson correlation coefficient, N s is 
the vector of protein numbers in subtype s, and N t is 
the vector of protein numbers in subtype t. 
Distance score function 

We provided a measure of distance or similarity 
between cancer subtypes according to pathway-protein 
profiling. The dendrogram for each cancer subtype was 
then constructed based on the Hierarchical cluster 



analysis with the inferred cancer subtype distance 
matrix. The detailed process is described as follows. 

Suppose a set of pathways in each cutoff, 
O k = {o\,02, ■ ■ ■ ,o m k], a set of proteins in each cutoff, 
B k = {b\, £>2, • • • , b n i,}, where k = 1,2,3 stands for three 
cutoffs (cutoffl q<0.2, cutoff2 q<0.l, and cutoff3 £7<0.05, 
respectively). Suppose a set of pathways in all three cut- 
offs, O = {oi, o 2 ,..., o m }, and a set of proteins in all three 
cutoffs, B = {b\, b 2 ,—, b n }, where m is the total number 
of pathways in all three cutoffs and n is the total num- 
ber of proteins in all three cutoffs. 

Pathway-protein matrices for the three cutoffs are Q 1 , 
Q 2 , and Q 3 , respectively. In each Pathway-protein 
matrix, the row stands for a list of pathways, 
the column a list of proteins. Its i, j entry 

(?|(fe = 1, 2, 3; i = 1, 2, . . . m;j =1,2, ...n) is k if the /th 
protein shows up in the /th pathway; otherwise 0. 

The final pathway-protein matrix Q = [(]ij]mxn can be 
obtained by merging the three matrices Q 1 , Q 2 , and 
Q 3 . Its element can be obtained by calculating the max- 
imum value as shown in the following equation: 



(9) 



Suppose the final pathway-protein matrix for the four 
subtypes (luminal A, luminal B, HER2+, and basal-like) 
and additionally unknown subtype is Q} , I = 1,2,3,4,5 
respectively. We define the distance between subtype s 
and subtype t as the average distance score for all path- 
way-protein pairs in Q s and Q , which is expressed as 
follows 



d st 



£ scorers, t) Efi^^j) 

i,j ij 



(10) 



m x n 



m x n 



where / stands for the distance score function, 
f{cf ii ,cf ii )=abs{cf ij -q\ j ). 

Pathway analysis 

Pathway Analysis are performed using the web tool 
HPD [24] and IP AD [21] we developed. 

Pathway association network 

The pathway similarity measure is defined as the extent 
of overlaps, e.g., common number of genes/proteins, 
shared between two different pathways [25]. The path- 
way-pathway similarity score S it ; is defined as 



IP.-nR-l 

Si,j = 77777177' = 1 ■ • ■ N,j = 1 . . . N, 



(11) 



\PiUPj\ 

where, N denotes total number of pathways. Pi and P, 
denote two different pathways, while |P,| and \Pj\ are the 
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numbers of proteins in these two pathways. Their inter- 
section Pi n Pj is the set of all proteins that appear in 
both Pi and Pp while their union P t U Pj is a set of all pro- 
teins either appearing in the P, or in the Pj. Duplicates 
are eliminated in the intersection set and union set. 

Cytoscape is used to visualize and associate pathway 
networks with pathway protein association matrix. The 
Edge line width is proportional to the similarity of the 
connected pathways. Node size is proportional to pro- 
tein numbers in a pathway. Node color is proportional 
to the number of proteins with change between samples 
of each subtype and healthy samples in a pathway. 

Prediction performance of four distance metrics 

In order to validate the pathway profiling, distance score 
function and permutation test we presented, we compare 
the true positives and accuracy of subtype prediction 
among Protein Intensity, Q-Value, Pathway Profiling 
with Traditional Permutation, Pathway Profiling, and 
Distance Score Function. 

The nearest neighbor prediction is used to predict can- 
cer subtypes with the 80 patients' data. The nearest neigh- 
bor prediction is a method that finds the closest 
(according to four different distance metrics) exemplar to 
the patient and predicts the subtype of the exemplar. 

For protein intensity, we directly use the 80 patients' 
protein intensity data. Each subtype protein intensity is 
used as an exemplar. The distance between subtype 5 and 
patient t is defined as 

d st = 1 - Corr{S s , P t ), (12) 

where Corr is the Pearson correlation coefficient. 

For Q- Value, we first compute the Q-Value for proteins 
in each patient by the statistical significance testing of the 
expression of the protein between all healthy women and 
the patient. Each subtype Q-Value is used as an exemplar. 
Pearson correlation coefficient distance measure is used to 
calculate how close each patient of the training set is to 
the five exemplars that are being examined. 

The distance between subtype s and patient t is 
defined as 

d st = 1 — Corr(cjvalue s , cfvalue t ), (13) 

where Corr is the Pearson correlation coefficient. 

For Pathway Profiling, based on the pathway-protein 
matrix for each patient and each subtype, we defined 
the distance between subtype s and patient t as 

dst = 1 - Corr{N s ,N t ), (14) 

where Corr is the Pearson correlation coefficient, N s is 
the vector of protein numbers in subtype s, and N t is the 
vector of protein numbers in patient t. Each subtype's 



vector of protein numbers in pathway-protein matrix is 
used as an exemplar. 

For Distance Score Function, the final pathway-protein 
matrix of each subtype is used as an exemplar. Similarly, 
we define the distance between subtype s and patient t 
as the average distance score for all pathway-protein 
pairs in Q s and Q , which is expressed as follows 

£ scorers, t) 

a _a g d5) 

m x n m x n 

where / stands for the distance score function, 
fitfij'llj) = ^Wij — m is the number of pathways, 
and n is the number of proteins. 

The prediction process of Pathway Profiling with tradi- 
tional Permutation is the same as that of Pathway Profiling 
except that the former uses traditional permutation and 
the latter uses the permutation test we presented. 

Results 

Plasma proteomics results for breast cancer cases and 
controls 

The proteomics experiment included 160 plasma sam- 
ples, 80 samples from women with breast cancer and 80 
from healthy volunteer women which serve as controls. 
The plasma proteome set have 616 proteins which were 
mapped to 1458 UniprotlD. Those proteins with no 
match in the Uniprot database were excluded from the 
pathway analysis. 

Histopathological data on clinical samples 

Receptor status based on immunohistochemical (IHC) 
expression of the estrogen receptor (ER) or progesterone 
receptor (PR) and human epidermal growth factor recep- 
tor 2 (HER2) proteins was used to approximate subtype: 
ER+ or PR+ and HER2- (luminal A); ER- or PR- and 
HER2+ (luminal B); ER- and PR- and HER2+ (HER2+); 
and ER- and PR- and HER2- (basal-like). We have a total 
of 68 plasma samples from four different subtypes of 
breast cancer patients and the other 12 patient samples we 
called unknown subtype in which receptor statuses were 
missing. The number of patients, age, tumor, and grade, in 
each subtype, are shown in Table 1. 

Dendrograms of breast cancer subtypes 

First we calculated the breast cancer subtypes distance 
based on Protein Intensity of 80 cancer samples. Each 
subtype's fixed effect was estimated by the Equation (5) 
and then the distance between two subtypes were 
obtained by the Equation (6). 

Then, we calculated the breast cancer subtypes distance 
based on Q-Value. The Q-Value of protein change 
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Table 1 clinical data summary for breast cancer patients 



Subtype Receptor Status 



Number 



Age 



Cancer Type 



Grade 



Metastasis 



Tumor Size 



LuminalA 



LuminalB 



Unknown 



ER+ I PR+ & HER2- 



ER+ I PR+ & HER2+ 



HER2Plus ER- & PR- & HER2+ 



BasalLike ER- & PR- & HER2- 



Missing 



24 



15 



10 



19 



12 



1- <40 
20-[40,65] 

3- >65 

4- <40 
9-[40,65] 

2- >65 

2- <40 

8- [40,65] 
0->65 

0- <40 
18-[40,65] 

1- >65 

1- <40 

9- [40,65] 

2- >65 



19 - INV 
4 - DCIS 
1 - u 

9 - INV 
6 - DCIS 



8 - INV 
2 - DCIS 

15 - INV 
4 - DCIS 



2 - INV 
2 - DCIS 
8 - u 



7 - G1 
10 - G2 
5 - G3 

2 - u 
G1 
G2 
G3 
- u 



0 - G1 

0 - G2 
10 -G3 

1 - G1 
3 - G2 
14 -G3 

1 - u 
1 - G1 
1 - G2 
0 - G3 
10 - u 



2 locally Recurrent 

3 Distant Metastasis 

18 No 

1 locally Recurrent 

2 Distant Metastasis 

12 No 

0 locally Recurrent 

4 Distant Metastasis 

6 No 

3 locally Recurrent 
3 Distant Metastasis 

13 No 

0 locally Recurrent 
2 Distant Metastasis 
1 No 



Min = 0.5 
Max = 8 
Mean = 2.550526 

Min = 0.2 
max = 5.4 
mean = 2.205556 

min = 0.9 

max = 1 1 
mean = 3.8 

min = 0.7 

max = 6 
mean = 2.5 



INV: invasive; DCIS: Ductal Carcinoma in Situ; u: unknown 



between healthy samples and cancer subtype samples 
were obtained by a three-step method mentioned in the 
statistics test section: 1) permutation test, 2) FDR adjusted 
p-value and 3) FDR Q- Value. The distance between two 
subtypes were obtained by the Equation (7). 

Next, we calculated the breast cancer subtypes distance 
based on Pathway Profiling. Three cutoffs (cutoffl q<0.2, 
cutoff2 q<0.1, and cutof£3 <7<0.05) were chosen to identify 
proteins with significant changes in the healthy samples 
and each of the breast cancer subtypes. For example, the 
numbers of proteins with significant changes identified by 
cutoffl were 780, 829, 818, 804, and 440, respectively, for 
luminal A, luminal B, HER2+, basal-like, and unknown. 

The updated KEGG human pathway database consists 
of 215 pathways and 4955 gene IDs which we mapped to 
22532 UniProt ID. We integrated the pathway database to 
our Oracle supported biolOG2 server and used the path- 
way web tool HPD [24] and IP AD [21] we developed to 
perform pathway analysis. 

We got three pathway-protein matrices for each sub- 
type, where the row is pathway and the column is protein 
biomarkers. Up-arrow stands for up-regulated, and down- 
arrow for down-regulated. The final pathway protein 
matrix after merging the three matrices is shown in Addi- 
tional File 1. The distance between two subtypes were 
then obtained by the Equation (8). 

Lastly, we calculated the breast cancer subtypes dis- 
tance based on Distance Score Function. The Cancer 
Subtype Distance Matrix was obtained from Equation 
(10) (as shown in Table 2). 



We tested with the four distance measurement meth- 
ods (Figures 2, 3, 4, 5). In the dendrogram of subtypes 
in Protein Intensity profiling (Figure 2), basal-like, lumi- 
nal A and luminal B together are clustered together. 
The distance matrices based on Q-Value across four 
subtypes and additionally unknown subtype clearly dis- 
tinguish between the luminal group (luminal A and 
luminal B) and the basal group (basal-like and HER2+) 
(Figure 3). Dendrograms in both Figure 4 and 5 show 
that 1) the luminal group (luminal A and luminal B) are 
clustered together, as well as the basal group (basal-like 
and HER2+), 2) the luminal group and the basal group 
are more close to each other than to unknown subtype, 
and 3) luminal A and luminal B are more close to each 
other than basal-like and HER2+ to each other(Figure 4, 
Figure 5 and Figure 6a). 

Evaluation 

Our validations of the dendrogram at the pathway-level 
are three-folds. First we tested the prediction perfor- 
mance of four distance metrics. A distance measuring 



Table 2 cancer subtype distance matrix 


Distance LuminalA 


LuminalB 


HER2Plus 


BasalLike 


Unknown 


LuminalA 0 


151 


573 


589 


1190 


LuminalB 151 


0 


544 


598 


713 


HER2Plus 573 


544 


0 


302 


990 


BasalLike 589 


598 


302 


0 


1011 


Unknown 1190 


713 


990 


1011 


0 
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Figure 2 heatmap of protein change across four subtypes (luminal a, luminal b, her2+, and basal-like) and additionally unknown 
subtype. The y axis is protein markers. Each colored cell represents a protein intensity calculated by the Equation (5). The rows use hierarchical 
clustering with Euclidean distance. The columns use hierarchical clustering with protein intensity distance defined in the Equation (6). 



method with an accurate performance prediction can be 
used as a good clustering method. Then Literature cura- 
tion was used to evaluate if our result is consistent with 
previous reports. Last, Pathway Association Network 



the pathway-assisted clustering, which in turn verified 
the feasibility of the method and the validity of the 
results. 

In our first evaluation, we validated the Pathway Profil- 



analysis was performed to mine the biological meaning of ing and Distance Score Function metrics method and 
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Figure 3 heatmap of q value across four subtypes (luminal A, 
luminal B, HER2+, and basal-like) and additionally unknown 
subtype. The y axis is protein markers. Each colored cell represents 
a Q value. The rows use hierarchical clustering with Euclidean 
distance. The columns use hierarchical clustering with Q-Value 
distance defined in the Equation (7). 



permutation test we presented by comparing among the 
four distance metrics: Protein Intensity, Q- Value, Path- 
way Profiling, Distance Score Function, and Pathway Pro- 
filing with Traditional Permutation (Figure 7 and 
Additional File 2). The Distance measuring based on 
Pathway Profiling and Distance Score Function show 
fewer incorrect predictions and markedly improved accu- 
racy (53 (66%) and 48 (71%) true positives for all five sub- 
types and all four known subtypes in Pathway Profiling; 
56 (70%) and 51 (75%) true positives for all five subtypes 
and all four known subtypes in Distance Score Function). 
The pathway profiling distance measuring based on our 
permutation test shows improved accuracy compared to 
the pathway profiling distance measuring based on tradi- 
tional permutation test (41 (51%) and 36 (53%) true posi- 
tives for all five subtypes and all four known subtypes in 
Distance Score Function; 53 (66%) and 48 (71%) true 
positives for all five subtypes and all four known subtypes 
in Pathway Profiling). 

In the literature curation, we found our results (Figure 
4, Figure 5, and Figure 6a) are consistent with previous 
findings. For example, the Cancer Genome Atlas Net- 
work analyzed primary breast cancers by genomic DNA 
copy number arrays, DNA methylation, exome sequen- 
cing, messenger RNA arrays, microRNA sequencing and 




Figure 4 pathway profiling across four subtypes (luminal A, 
luminal B, HER2+, and basal-like) and additionally unknown 
subtype(q-value < 0.1). The y axis is pathways. Each colored cell 
represents number of protein in a pathway. The rows use 
hierarchical clustering with Euclidean distance. The columns use 
hierarchical clustering with Pathway Profiling distance defined in the 
Equation (8). 



reverse-phase protein arrays [26] . Their results from the 
genomic, clinical and proteomic features of breast cancer 
subtypes have established a high reliability compared to 
previous clustering results based singly on gene expres- 
sion values. Based on the summary of the genomic, clini- 
cal and proteomic features of subtypes, which include the 
percentage of ER+/HER2-, HER2+, TNBCs, TP53 path- 
way, PIK3CA/PTEN pathway, RBI pathway and DNA 
mutations, we performed hierarchical cluster analysis 
using the Euclidean distance metric (Figure 8). In the Fig- 
ure 8, the luminal group (luminal A and luminal B) are 
grouped together and the basal group (basal-like and 
HER2+) are grouped together too, and luminal A and 
luminal B are more close to each other than basal-like 
and HER2+ to each other, which are consistent with our 
results in Figure 4, Figure 5 and Figure 6a. 

Our results are also in accordance with some clustering 
analysis based on gene expression [27-29]. But clustering 
based on gene expression had diverse results. For exam- 
ple, S0rlie et al found that the luminal group and basal 
group were separated in 78 breast cancer tissue samples 
[27] and that Luminal B was categorized into basal group 
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Figure 5 hierarchical clustering of breast cancer subtypes. The distances were computed by the Equation (10) according to the Distance 
Score Function. 



in 115 malignant breast tumors [30]. Marcotte et al. iden- 
tified that HER2+ was more close to Luminal group than 
to basal-like [31]. The low reproducibility of the subtypes' 
distance measuring from the expression of individual 
genes in microarray experiments has led to the sugges- 
tion that experiments be analyzed in mass spectrometry 
and in terms of gene functions and pathways, in order to 
enhance the robustness of the results. 

Lastly, we performed the Pathway Association Network 
Analysis to understand the intrinsic relationship between 
the subtypes of breast cancers which in turn validated 
our results of dendrogram based on Pathway Profiling 
and Distance Score Function. 

The top 22 pathways were selected from the pathway 
protein matrices in the Additional File 1 with average pro- 
tein number >5 (Figure 6a). In the Figure 4, pathway pro- 
filing with all pathways shows luminal A and luminal B 
hold together just a little bit tighter than basal-like and 
HER2+. After we filtered out those pathways with average 
protein number >5, we obtained a similar dendrogram 
(Figure 6a) to the dendrogram (Figure 5) we obtained 
using Distance Score Function, where luminal A and lumi- 
nal B are more close to each other than basal-like and 
HER2+ to each other. This might suggest that functionally 
important pathways contribute more to the intrinsic rela- 
tionship between the subtypes of breast cancer. 

We then used the top 22 pathways to build a Pathway 
Association Network for each subtype as shown in Fig- 
ures 6b-6f. The increasing red depth of nodes across the 



four subtypes (luminal A, luminal B, HER2+, and basal- 
like) and the unknown typing indicates the activated 
pathways due to tumor progression. The diagram of the 
Pathway Association Network provides visual cues about 
how the network changes with development of subtype. 
It shows that the changes of different breast cancer sub- 
types at the pathway level are more relatively conserved 
although each breast cancer subtype may be associated 
with change of many different proteins (Additional 
File 1). For example, 1) 15 out of the 22 pathways are in 
common across luminal A, luminal B, basal-like, and 
HER2+ (neuroactive ligand-receptor interaction, reninan- 
giotensin system, huntington's disease, chemokine signal- 
ing pathway, alzheimer's disease, cell adhesion molecules, 
olfactory transduction, endocytosis, calcium signaling 
pathway, toll-like receptor signaling pathway, oxidative 
phosphorylation, ppar signaling pathway, vascular 
smooth muscle contraction, parkinson's disease, thyroid 
cancer); 2) an immune system pathway (complement and 
coagulation cascades), an immune disorders pathway 
(systemic lupus erythematosus), and a neurodegenerative 
diseases pathway (prion diseases) are more activated 
from the basal group to the luminal group, but the 
change of activated pathways within either the basal 
group or the luminal group is relatively conserved if com- 
pared with those change between the two groups. Lumi- 
nal A and luminal B share similar network structure 
although focal adhesion pathway and regulation of actin 
cytoskeleton pathway can distinguish luminal A from 
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a) Ileatmap of Protein Number across the four breast 
cancer subtypes (luminal A, luminal B, HER2+, and 
basal-like) and additionally unknown subtyped 

Figure 6 pathway association network. Edge line width is proportional to the connected pathways similarity. Node size is proportional to protein 
numbers in a pathway. Node color is proportional to the number of proteins with change between each subtype and control in a pathway. 



luminal B. And basal-like and HER2+ share similar net- 
work structure, although complement and coagulation 
cascades pathway and systemic lupus erythematosus can 
distinguish basal-like and HER2+. The relationship of 



breast cancer subtypes shown in (Figures 6b-6f) are in 
consistent with the intrinsic relationships in the dendro- 
grams we drew based on Pathway Profiling and Distance 
Score Function. 
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Distance Score Function 



Discussion 

Distances between breast cancer subtypes 

Dendrogram linking gene/protein or subtype samples can 
be generated to show degrees of similarity between breast 
cancer subtypes. The dendrogram of subtypes in Protein 
Intensity profiling (Figure 2) grouped basal-like, luminal A 
and luminal B together. The distance matrix based on Q 
value across the four known subtypes produced similar 
hierarchical clustering with previously reported works 
from degree of similarity of gene expression profile 
between subtypes[5]. Basal-like should not be classified as 
Luminal tumor, because most of basal-like and HER2+ are 
significantly more likely to be grade III than the luminal A 
tumors [5,7,8]. Compared to the luminal A which appears 
to be associated with the best prognosis, the HER2 and 
basal-like have poor prognostic feature as defined by rou- 
tine pathology methods [5,7,30,32]. The possible reason is 
that the degree of similarity between subtypes based on 
gene expression value or protein intensity might not cor- 
rectly indicate the relationship between subtypes, (they 
may indicate well the relationship between genes or pro- 
teins). Although useful relationships between co-expressed 
genes or similar protein change patterns may be discov- 
ered from the variance of gene expression or protein 
intensity, genes with similar expression patterns or pro- 
teins with similar intensity change don't always mean they 
have similar functions or play roles in similar pathways; 
therefore, to some extent, they will miscategorize samples 
(subtypes). 

Pathway Profiling and Distance Score Function both 
outperform the traditional gene expression or protein 



change profiling in measuring the distance or similarity 
of subtypes. Dendrograms in Figures 4 and 5 indicate: 1) 
the luminal group (luminal A and luminal B) are grouped 
together, as well as the basal group (basal-like and HER2 
+), 2) the luminal group and the basal group are more 
close to each other than to unknown subtype, and 3) 
luminal A and luminal B are more close to each other 
than basal-like and HER2+ to each other(Figure 4, Figure 
5 and Figure 6a). As mentioned in evaluation section, 
these are consistent with the previous findings [26-29]. 

We believe the dendrograms we drew in Figures 5 and 
6a relatively accurately reflects the intrinsic relationship 
between the four breast cancer subtypes. For example, the 
survival analysis of breast cancer subtypes found that the 
breast cancer subtypes also differed significantly in breast 
cancer specific survival: basal-like (75%), HER2+(52%), 
luminal A (84%), and luminal B (87%) [7]. Relatively high 
Kaplan-Meier survival curves were observed in Luminal A 
and Luminal B, and relatively steep falls in breast cancer 
specific survival were observed in the first 4 to 5 years for 
the basal-like and HER2+ and Basal-like [7]. 

Our results suggest that the distance or similarity of can- 
cer subtypes based on pathway analysis might be able to 
filter the noise in gene expression variance or protein 
intensity change; therefore, focusing on function and inter- 
action between them might provide further insight into 
the intrinsic relationship of cancer subtypes. Our results 
also revealed that intrinsic trait combination specific path- 
way changes may influence tumor progression and be 
helpful for early detection and early therapeutic inter- 
vention. 
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There are several potential limitations to this study. 
First, classifications based on ER PR and HER2 status are 
only approximations of the molecular breast cancer sub- 
types. Conclusions based on the receptor-based approxi- 
mations cannot necessarily be applied to the molecular 
subtypes. For example, Carey et al. [7] reported that this 
definition for luminal B does not identify all luminal B 
tumors because only 30% to 50% are HER2 positive, so 
our approximation of the luminal B group as ER or PR 
positive and HER2 positive may miscategorize a propor- 
tion of the true luminal B group as luminal A. Another 
possible limitation to the study relates to the relatively 



small numbers in some subtypes, particularly the luminal 
B group and HER2+ group, which contained only 15 and 
10 patients, respectively, compared to luminal A with 24 
patients. However, with the development of our ongoing 
CPTAC, more patient samples will be collected. And in 
the future, as the technology of molecular markers 
improves, more sophisticated markers than ER, PR, and 
HER2 immunophenotype may become available. Third, 
there is Unknown subtype instead of normal-like subtype 
in our data. Although we guess it might possibly be nor- 
mal-like subtype, it cannot be treated as normal like sub- 
type. Figure 8 shows the improved prediction accuracy 



-BasalLike 



HER2Plus 



-LuminalB 



-LuminalA 



Figure 8 dendrogram of primary breast cancers from the cancer genome atlas network The summary of the genomic, clinical and 
proteomic features of subtypes from the Cancer Genome Atlas Network, which includes the percentage of ER+/HER2-, HER2+, TNBCs, TP53 
pathway, PIK3CA/PTEN pathway, RB1 pathway and DNA mutations, was used to draw the phylogenetic tree with hierarchical cluster analysis 
under Euclidean distance metric. 
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after we remove the Unknown subtype. Correctly catalo- 
ging those unknown subtypes will help to improve dis- 
tance measuring and prediction of cancer subtype. 

Breast cancer and other disease pathway 

Breast cancer genes, as genes of a complex disease, not 
only suffer from the disturbance of its local sub-network, 
but also share some common genetic origin with other 
related diseases. For example, Gol et al. suggested by the 
cancer sub-network analysis that many cancer pheno- 
types share common genetic origins with other diseases 
[33]. 

Further we investigated the breast cancer-related dis- 
ease pathways including complement system, pathways 
in cancer, thyroid cancer, systemic lupus erythematosus, 
huntington's disease, prion diseases, alzheimer's disease, 
parkinson's disease, and oxidative phosphorylation (addi- 
tional file 1). 

The complement system consists of a series of about 25 
proteins that work to "complement" the work of antibo- 
dies in destroying bacteria. The complement system has 
important protective functions in autoimmune system, 
but can also, when inappropriately activated, cause tissue 
damage. Deficiencies in the early pathway components 
CI, C2 and C4 predispose the development of autoim- 
mune disease such as SLE and SLE-like disorders. In 
recent years, it has become evident that complement acti- 
vation is involved in the tumor cytotoxicity [34]. Adminis- 
tration of murine monoclonal antibodies against various 
tumor antigens in patients leads to antitumor effects 
including tumor regression and the localizaiton of both 
C4 and C3 at the tumor site [34]. The ability of murine 
monoclonal antibodies to mediate antibody-dependent 
cellular cytotoxicity (ADCC) with human effector cells 
and cytotoxicity mediated by complement activation 
makes these antibodyies promising candidates for cancer 
therapy [35]. 

In the pathways in cancer, TGFA and ERBB2 are both 
important regulators of normal mammary gland physiol- 
ogy, and aberrations in their signaling have been asso- 
ciated with breast tumorigenesis [36]. ERBB2 in breast 
cancer have been approved for clinical use. And the 
TGFR are potentially amenable to therapies for treatment 
of human breast disease [37]. In the Huntington's disease, 
parkingson's disease, oxidative phosphorylation, and alz- 
heimer's disease pathways, CLTC is involved in inflam- 
matory myofibroblastic tumors [38] and lower expression 
of MAPT is associated with HER2 overexpression [39]. 

Permutation test 

Most protein identification methods were based on fold 
change. For example, a fold change or two samples Stu- 
dent's original t test was carried out by comparing phy- 
siological changes between normal and disease states to 



identify serum biomarkers to detect breast cancer [40]. 
As we know, a fold change method doesn't take the 
variability of a protein into account and a t-test requires 
an assumption of normal distribution of data. However, 
the datasets we used didn't show normal distribution. 
The log2 transformated intensity values for all 1458 pro- 
teins from healthy women were not from a normal dis- 
tribution (One-sample Kolmogorov-Smirnov test, D = 
0.0419, p-value < 2.2e-16). We also found the intensity 
values from the four breast cancer subtypes and addi- 
tionally unknown subtype were not from a normal dis- 
tribution either. 

T-test is a parametric test and the permutation process 
is non-parametric. By using permutation test we made 
no assumption about the distribution under the null 
hypothesis. Usually, the assumptions in the null hypoth- 
esis are weakened, and it becomes harder to reject. 

In addition to validation using a normal quantile plot, if 
using Student's original definition of the t-test, the two 
populations being compared should have the same var- 
iance. If the sample sizes in the two groups being com- 
pared are not equal, Student's original t-test is not robust 
to the presence of unequal variances [41]. Welch's t-test 
has been used by most statistics packages such as t.test 
function in R when the two sample variance is assumed to 
be different because it is insensitive to equality of the var- 
iances regardless of whether the sample sizes are similar. 
However, if we have no good reason to believe that the 
population variances are unequal, the result of Student's 
Original t-test becomes more reliable than that of Welch's 
t-test. For example, suppose two random samples, (30.02, 
29.99, 30.11, 29.97, 30.01, 29.99) and (29.89, 29.93, 29.72, 
29.98, 30.02, 29.98) (F test statistics = 0.2122, p-value = 
0.1141 > 0.05. We cannot reject the null hypothesis that 
true difference in variances is equal to 0), the result of Stu- 
dent's original t-test (statistics = 1.959, p-value = 0.078) 
becomes more suggestive of a difference in the mean for 
the two populations of samples than that of Welch's t-test 
(statistics = 1.959, p-value = 0.091). 

During the data permutation, even if the two samples 
could originally have come from the same population, 
the variance difference between the two permuted sam- 
ples could change at each resampling. Using only one 
type of t-test invariably during the permutation process 
will obviously result in inaccurate statistics of signifi- 
cance. For example, the ttperm function in Category 
package of R language uses only Welch's t-tests to per- 
form each permutation. 

Using the q value change based on the permutation test 
p value, breast cancer was classified into four breast can- 
cer subtypes and one unknown subtype with HER2+ and 
basal-like grouped together (Figure 3), whereas with the 
traditional permutation test, basal-like was classified as 
Luminal group, same as the dendrogram in the protein 
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change (Figure 2). Our permutation test method is highly 
robust to the equality of the variances regardless of 
whether the same sizes are similar and carries more con- 
viction than the other permutation test, which doesn't 
consider the effect of equality of variances. 

Conclusion 

We report for the first time the pathway-assisted cluster- 
ing of breast cancer plasma samples, using LC-MS/MS 
proteomics results. Even though proteomics experiment 
suffer from a general perception of being noisy and 
highly variable, we show that with proper bioinformatics 
integration of breast cancer biological context, it is possi- 
ble to achieve accurate and sensitive breast cancer sub- 
type classifications. We believe pathway analysis 
performed at the level similar to ours, which include 
both intra-pathway and inter-pathway analysis, is key to 
overcoming noises in the data. Our results also show that 
proteomic pathway-assisted clustering of breast cancer 
subtypes can provide biological insight into the intrinsic 
mechanisms and relationships between different breast 
cancer subtypes. This insight may help researchers 
develop diagnostic solutions and customized treatment 
plans, all based on blood sample. We believe integrative 
pathway-assisted proteomics analysis described here can 
become a model for reliable classification of other cancer 
subtypes and can be used for mining information hidden 
both within a pathway and between pathways for all 
cancers. 

Additional material 



Additional file 1: pathway protein matrix. 

Additional file 2: comparing between the four distance metrics 
(protein intensity, q-value, pathway profiling, and distance score 
function) and pathway profiling with traditional permutation 
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